rperez8013@floridapoly.comThis Data-visualization project is composed of California Housing Prices from the 1990 Census of the State of California.
The objective is to make use of the toolset and principles of data visualization, displaying and uncovering trends, patterns, tendencies, and outlieres, using ggplot for R, this report will create:
Data transformation using functions like, filter, select, group_by and other.
Bar charts,line charts, and others.
Scatter plots, histograms.
Dashboards.
Gggplot is the library used
Coding language is R.
Rstudio is the integrated development environment.
For spatial visualization the package is SF.
Fitting of a Linear Regression Analysis.
The California Housing Prices contains the median house prices of California from the 1990 census. Here’s a short summary of each term:
longitude: A measure of how far west a house is; a higher value is farther west
latitude: A measure of how far north a house is; a higher value is farther north
housingMedianAge: Median age of a house within a block; a lower number is a newer building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households, a group of people residing within a home unit, for a block
medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US Dollars)
oceanProximity: Location of the house w.r.t ocean/sea
Initially, I was drawn to the dataset of Houses in the West Roxbury neighborhood, yet when I started analyzing it does not have the spatial data needed to create a map plot. in addtion to the process of:
Gathering the data
Visualizing the data
Plotting the data
Finding correlations within the data
I want to tell the history about how the prices move according to the distance to the ocean, this information is contained in the California Housing Prices dataset.
By taking this class, I have started to use the principles of data visualization as we are learning them:
Application:
Tailor visuals to the audience’s level of data literacy.
Use simple charts (e.g., bar, line, pie) for general audiences, and more complex ones (e.g., heatmaps, scatter plots) for technical viewers.
Application:
Use bar charts for comparisons.
Use line charts for trends over time.
Use map-plots to visualize data point in a map.
Use scatter plots for correlation.
Use heatmaps or choropleths for intensity across geography or matrices.
Application:
Eliminate 3D effects, unnecessary gridlines, or distracting colors.
Use white space effectively to draw attention to key insights.
Use data-to-ink ratio thinking—remove any visual element that doesn’t communicate data.
Application:
Use color to highlight key insights (e.g., red for risks, green for success).
Ensure color-blind-friendly palettes.
Use consistent color meaning across visuals (e.g., blue always means “this year”).
Application:
Always include clear axis labels, titles, units, and legends.
Compare values against benchmarks, averages, or goals.
Use annotations or captions to guide interpretation when needed.
Application:
Start axes at zero (when appropriate) to avoid misleading the viewer.
Don’t manipulate scales to exaggerate or hide differences.
Represent proportional data accurately (e.g., don’t make area sizes misleading).
Application:
Use a sequence of visuals to walk viewers through a narrative (e.g., before vs. after, cause vs. effect).
Use titles and subtitles to communicate key takeaways.
Guide the viewer: what do we want them to learn, conclude, or act upon?
Data summary is a concise way to describe and understand the main features of a dataset. It helps to quickly grasp patterns, trends, and outliers without going into the raw data.
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## ocean_proximity median_house_value
## Length:20640 Min. : 14999
## Class :character 1st Qu.:119600
## Mode :character Median :179700
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
Histogram is a graphical representation of the distribution of a numeric dataset. It groups data into intervals (called bins) and shows how many values fall into each bin. Why we want to plot an histogram of the data? To understand the distribution (normal, skewed, bimodal, etc.)
To spot outliers or clusters
To summarize large datasets
To compare shapes of distributions across groups
It is likely that the Island feature is skewing the data due to its value feature.
Pearson Correlation measures the linear relationship between two variables. Range: from -1 to +1
+1 = perfect positive correlation
0 = no correlation
-1 = perfect negative correlation
Does correlation always imply causation? Correlation is often the first clue, but to imply causation, we usually need:
Controlled experiments
Longitudinal studies
Strong theoretical support
Ruling out confounders
Still correlation means two variables are related — when one changes, the other tends to change as well in positive or negative direction.
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| longitude | 1.0000000 | -0.9246644 | -0.1081968 | 0.0445680 | NA | 0.0997732 | 0.0553101 | -0.0151759 | -0.0459666 |
| latitude | -0.9246644 | 1.0000000 | 0.0111727 | -0.0360996 | NA | -0.1087847 | -0.0710354 | -0.0798091 | -0.1441603 |
| housing_median_age | -0.1081968 | 0.0111727 | 1.0000000 | -0.3612622 | NA | -0.2962442 | -0.3029160 | -0.1190340 | 0.1056234 |
| total_rooms | 0.0445680 | -0.0360996 | -0.3612622 | 1.0000000 | NA | 0.8571260 | 0.9184845 | 0.1980496 | 0.1341531 |
| total_bedrooms | NA | NA | NA | NA | 1 | NA | NA | NA | NA |
| population | 0.0997732 | -0.1087847 | -0.2962442 | 0.8571260 | NA | 1.0000000 | 0.9072223 | 0.0048343 | -0.0246497 |
| households | 0.0553101 | -0.0710354 | -0.3029160 | 0.9184845 | NA | 0.9072223 | 1.0000000 | 0.0130331 | 0.0658427 |
| median_income | -0.0151759 | -0.0798091 | -0.1190340 | 0.1980496 | NA | 0.0048343 | 0.0130331 | 1.0000000 | 0.6880752 |
| median_house_value | -0.0459666 | -0.1441603 | 0.1056234 | 0.1341531 | NA | -0.0246497 | 0.0658427 | 0.6880752 | 1.0000000 |
As I want to focus on the relationship between the Median House Value and Median House Price, I need to transform de data, and peform One Hot Encoding: One Hot Encoding transforms each category value into a new binary (0 or 1) column.
## longitude latitude housing_median_age
## longitude 1.000000000 -0.92466443 -0.10819681
## latitude -0.924664434 1.00000000 0.01117267
## housing_median_age -0.108196813 0.01117267 1.00000000
## total_rooms 0.044567978 -0.03609960 -0.36126220
## total_bedrooms NA NA NA
## population 0.099773223 -0.10878475 -0.29624424
## households 0.055310093 -0.07103543 -0.30291601
## median_income -0.015175865 -0.07980913 -0.11903399
## median_house_value -0.045966615 -0.14416028 0.10562341
## ocean_proximityINLAND -0.055574654 0.35116598 -0.23664459
## ocean_proximityISLAND 0.009445503 -0.01657165 0.01701984
## ocean_proximityNEAR.BAY -0.474488910 0.35877099 0.25517166
## ocean_proximityNEAR.OCEAN 0.045508838 -0.16081792 0.02162156
## total_rooms total_bedrooms population households
## longitude 0.044567978 NA 0.099773223 0.055310093
## latitude -0.036099596 NA -0.108784747 -0.071035433
## housing_median_age -0.361262201 NA -0.296244240 -0.302916009
## total_rooms 1.000000000 NA 0.857125973 0.918484493
## total_bedrooms NA 1 NA NA
## population 0.857125973 NA 1.000000000 0.907222266
## households 0.918484493 NA 0.907222266 1.000000000
## median_income 0.198049645 NA 0.004834346 0.013033052
## median_house_value 0.134153114 NA -0.024649679 0.065842651
## ocean_proximityINLAND 0.025624325 NA -0.020732123 -0.039402469
## ocean_proximityISLAND -0.007571767 NA -0.010412114 -0.009077005
## ocean_proximityNEAR.BAY -0.023022417 NA -0.060880154 -0.010093339
## ocean_proximityNEAR.OCEAN -0.009175150 NA -0.024263727 0.001714434
## median_income median_house_value
## longitude -0.015175865 -0.04596662
## latitude -0.079809127 -0.14416028
## housing_median_age -0.119033990 0.10562341
## total_rooms 0.198049645 0.13415311
## total_bedrooms NA NA
## population 0.004834346 -0.02464968
## households 0.013033052 0.06584265
## median_income 1.000000000 0.68807521
## median_house_value 0.688075208 1.00000000
## ocean_proximityINLAND -0.237495762 -0.48485933
## ocean_proximityISLAND -0.009228171 0.02341608
## ocean_proximityNEAR.BAY 0.056196803 0.16028448
## ocean_proximityNEAR.OCEAN 0.027343611 0.14186217
## ocean_proximityINLAND ocean_proximityISLAND
## longitude -0.05557465 0.009445503
## latitude 0.35116598 -0.016571648
## housing_median_age -0.23664459 0.017019840
## total_rooms 0.02562432 -0.007571767
## total_bedrooms NA NA
## population -0.02073212 -0.010412114
## households -0.03940247 -0.009077005
## median_income -0.23749576 -0.009228171
## median_house_value -0.48485933 0.023416076
## ocean_proximityINLAND 1.00000000 -0.010614425
## ocean_proximityISLAND -0.01061443 1.000000000
## ocean_proximityNEAR.BAY -0.24088703 -0.005498984
## ocean_proximityNEAR.OCEAN -0.26216349 -0.005984684
## ocean_proximityNEAR.BAY ocean_proximityNEAR.OCEAN
## longitude -0.474488910 0.045508838
## latitude 0.358770991 -0.160817925
## housing_median_age 0.255171663 0.021621556
## total_rooms -0.023022417 -0.009175150
## total_bedrooms NA NA
## population -0.060880154 -0.024263727
## households -0.010093339 0.001714434
## median_income 0.056196803 0.027343611
## median_house_value 0.160284484 0.141862170
## ocean_proximityINLAND -0.240887033 -0.262163488
## ocean_proximityISLAND -0.005498984 -0.005984684
## ocean_proximityNEAR.BAY 1.000000000 -0.135818271
## ocean_proximityNEAR.OCEAN -0.135818271 1.000000000
##
## Call:
## lm(formula = median_house_value ~ ., data = enc_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -556980 -42683 -10497 28765 779052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.270e+06 8.801e+04 -25.791 < 2e-16 ***
## longitude -2.681e+04 1.020e+03 -26.296 < 2e-16 ***
## latitude -2.548e+04 1.005e+03 -25.363 < 2e-16 ***
## housing_median_age 1.073e+03 4.389e+01 24.439 < 2e-16 ***
## total_rooms -6.193e+00 7.915e-01 -7.825 5.32e-15 ***
## total_bedrooms 1.006e+02 6.869e+00 14.640 < 2e-16 ***
## population -3.797e+01 1.076e+00 -35.282 < 2e-16 ***
## households 4.962e+01 7.451e+00 6.659 2.83e-11 ***
## median_income 3.926e+04 3.380e+02 116.151 < 2e-16 ***
## ocean_proximityINLAND -3.928e+04 1.744e+03 -22.522 < 2e-16 ***
## ocean_proximityISLAND 1.529e+05 3.074e+04 4.974 6.62e-07 ***
## ocean_proximityNEAR.BAY -3.954e+03 1.913e+03 -2.067 0.03879 *
## ocean_proximityNEAR.OCEAN 4.278e+03 1.570e+03 2.726 0.00642 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68660 on 20420 degrees of freedom
## (207 observations deleted due to missingness)
## Multiple R-squared: 0.6465, Adjusted R-squared: 0.6463
## F-statistic: 3112 on 12 and 20420 DF, p-value: < 2.2e-16
The standard error measures the precision of a sample statistic (like the mean or a regression coefficient). It tells you how much the estimate is expected to vary if you repeated the sampling process many times.
A small SE means the estimate is more precise.
A large SE means the estimate is more variable (less reliable).
SE is used to compute confidence intervals and t-values.
A t-value (or t-statistic) is a number that comes from a t-test, which is a statistical method used to compare means and determine if differences are statistically significant. The t-value measures how far your sample statistic (like a sample mean) is from the null hypothesis value, in units of standard error.
A larger t-value (positive or negative) means your result is further from the null hypothesis.
A t-value near 0 means your sample mean is close to the null hypothesis mean.
You compare the t-value to a critical value (based on degrees of freedom and chosen confidence level) to decide if the result is statistically significant.
R-squared (also written as R^2) is a statistical measure that tells you how well your regression model fits the data. Specifically, it represents the proportion of the variance in the dependent variable that is explained by the independent variable(s).
Summary:
The model fits reasonably well (R² ≈ 0.65).
Most variables are statistically significant.
median_income is the strongest positive predictor.
Location features (longitude, latitude, ocean_proximity) are very important.
Population and housing structure (rooms, households) affect value but may be entangled in multicollinearity1.
Very Important Predictors
| Predictor | Estimate | t-value | Observations |
|---|---|---|---|
| median_income | +39,260 | 116.2 | Strongest positive effect on house value. More income = higher house value. |
| population | −37.97 | −35.3 | Larger populations are associated with lower house values. |
| longitude | −26,810 | −26.3 | More western location (longitude more negative) = lower value. |
| latitude | −25,480 | −25.4 | More northern location = lower value. (Suggests high-value areas are clustered in southern California.) |
| housing_median_age | +1,073 | +24.4 | Older homes tend to be more valuable. |
| ocean_proximityINLAND | −39,280 | −22.5 | Inland properties are much cheaper compared to the reference category. |
Important Predictors
| Predictor | Estimate | t-value | Observations |
|---|---|---|---|
| total_bedrooms | +100.6 | +14.6 More | bedrooms = higher value, but likely correlated with income or household size. |
| total_rooms | −6.19 | −7.83 | Surprisingly negative, may indicate multicollinearity (e.g., with households or bedrooms). |
| households | +49.6 | +6.66 | More households = higher median value (urban/suburban effect). |
| ocean_proximityISLAND | +152,900 | +4.97 | Island properties are significantly more valuable. |
Less Important (still statistically significant)
| Predictor | Estimate | t-value | Observations |
|---|---|---|---|
| ocean_proximityNEAR.OCEAN | +4,278 | +2.73 | Small positive impact on value. |
| ocean_proximityNEAR.BAY | −3,954 | −2.07 | Small negative effect (barely significant). |
Next Steps
Plot absolute t-values or standardized coefficients.
Use stepwise selection, Lasso regression, or random forest to compare and confirm variable impact.
Check for multicollinearity (e.g., using VIF scores) to see if some variables are redundant.
Multicollinearity happens when two or more predictor variables in a regression model are highly correlated with each other. This means they contain overlapping information, which makes it hard for the model to determine which variable is actually influencing the outcome.↩︎